Reproducible Research: Principles, Practices, and Tools for Generating Reproducible Statistical Analyses and Reports

CSTAT webinar series on Responsible and Ethical Conduct of Research, East Lansing, MI

Steven J. Pierce

Center for Statistical Training and Consulting, Michigan State University

2025-03-06

Outline

  • What do you mean by reproducible?
  • Why should we aim for reproducibility?
  • How do we achieve reproducibility?
  • Demonstration

Reproducible Research (RR)

… is achieved when investigators share all the materials required to exactly recreate the findings so that others can verify them or conduct alternative analyses.

Statistical Results Can Be:

Repeatable

  • Original analyst
  • Original data

Reproducible

  • New analyst
  • Original data

Replicable

  • New analyst
  • New data

RR is a Product

RRProduct Quant Quantitative Methods Mixed Mixed Methods Quant->Mixed RR Reproducible Research Quant->RR Qual Qualitative Methods Qual->Mixed Qual->RR Mixed->RR

RR is a product of how we work, not which methods we use.

Methods Matter!

Continuum Low Low High High Low->High Trustworthiness & Credibility

Irreproducible < Reproducible < Replicated

Important

Reproducibility is an attainable minimum standard for science[1].

Funders Value RR

Data sharing and reproducibility initiatives

Emerging Scientific Norms

Publishing Technology

PubTech cluster_0 Print Distribution cluster_1 Online Distribution Easy, low cost, no page limits! Au Author Ar Archive Au->Ar Jo Journal Jo->Au Pa Manuscript Jo->Pa SF Manuscript & Supplemental Files Jo->SF CD Codebooks, Data, & Software Ar->CD

Career Benefits of RR

  • Motivation to focus on quality
  • Become more efficient
  • Create more products
  • Easier to get published
  • Get cited more often
  • Build your reputation

Materials Required to Recreate Findings

Materials Findings
Manuals & procedures Statistics
Instruments & scoring rules Coefficients & p-values
Codebooks Confidence intervals
Methods applied Effect sizes
Data mgt decisions Model fit indices
Data files Figures
Software & analysis scripts Tables

Principles for Achieving Reproducibility

  • Collaboration
  • Organization
  • Automation
  • Preservation
  • Integration
  • Separation

Workflow Woes

  • Use GUI, menus, & dialog boxes to do tasks
  • Manually updating data files
  • Version control through saving to new file names
  • Disorganized folders & files
  • No audit trail
  • Copy & paste output from stats software to word processor
  • Fixing mistakes took lots of time

Dynamic Documents Via Quarto + R

DynDoc Data Study_Data.csv (Raw Data File) Quarto Report.qmd (Quarto Script w/ R code) Data->Quarto Read by Report Report.pdf (Formatted Output) Quarto->Report Render BibTeX references.bib (BibTeX Data File) BibTeX->Quarto Cited in CSL apa.csl (Citation Style Language File) CSL->Quarto Used by

GitHub

GitHub cluster_0 GitHub Server cluster_1 Computer 1 cluster_2 Computer 2 RR OurProject (Remote Main Repository) LR1 OurProject (Local Repository) RR->LR1 Clone/ Pull RR->LR1:ne Push LR2 OurProject (Local Repository) RR->LR2:nw Push RR->LR2 Clone/ Pull

Example Folder Structure

MyStudy/            [Compendium, Git repository, R package, RStudio project]
  - .git/           [Hidden folder, holds Git tracking database]
  - .Rproj.user/    [Hidden folder, holds RStudio temporary files]
  - data/           [Holds R data files created by scripts]
  - man/            [Holds R help files for package & custom functions]
  - R/              [Holds R scripts w/ custom functions]
  - scripts/        [Quarto project, holds dynamic documents]  
    - extdata/      [Holds external data files to be imported]
    - output/       [Holds rendered output]
  - .gitignore      [Tells Git what to omit from tracking]
  - DESCRIPTION     [R package meta-data]
  - MyProject.Rproj [RStudio project file & settings]
  - NEWS.md         [News for users re: changes to package]
  - README.Rmd      [Dynamic document, creates README.md]
  - README.md       [Rendered output, R package documentation]

Organization Tips

  • Add documentation!
  • Use file naming conventions to links scripts & output
  • Use data flow diagrams to show relations between files
  • Use a primary rendering script to run other scripts

Example Data Flow Diagram

Flow RenderS Render.qmd RenderO Render.html RenderS->RenderO Creates CDS Cleaning.qmd RenderS->CDS Runs ReportS Report.qmd RenderS->ReportS Runs SlidesS Slides.qmd RenderS->SlidesS Runs CDO Cleaning.html CDS->CDO Creates CData Clean_Data.RData CDS->CData Creates RData Raw_Data.csv RData->CDS CData->ReportS Read by CData->SlidesS Read by ReportO Report.pdf ReportS->ReportO Creates SlidesO Slides.ppt SlidesS->SlidesO Creates

Preservation Tips

What should you preserve?

  • Data and meta-data
  • Methodology decisions & rationales
  • Scripts (data cleaning, management, & analysis)
  • Version histories of data, code, & output
  • Software environment & versions used

Separation Tips

Separate This From That
Raw Data | Cleaned Data
Code | Data
Code | Output
Data Management | Data Analysis
Draft Outputs | Final Deliverables
Current Files | Version History

Quarto Scripts: Anatomy

  • Plain text files with names ending in .qmd
  • Use markdown syntax to format rendered output
  • YAML header contains metadata, options, & parameters
  • Body contains the main content

MyStudy Demonstration Files

Let’s take a look at actual dynamic documents!

  • scripts/MyStudy_Report.qmd
  • scripts/output/MyStudy_Report.html

Bonus Resources

References

1. Peng, R. D., Dominici, F., & Zeger, S. L. (2006). Reproducible epidemiologic research. American Journal of Epidemiology, 163(9), 783–789. https://doi.org/10.1093/aje/kwj093
2. Bosnjak, M., Fiebach, C. J., Mellor, D., Mueller, S., O’Connor, D. B., Oswald, F. L., & Sokol, R. I. (2022). A template for preregistration of quantitative research in psychology: Report of the Joint Psychological Societies Preregistration Task Force. American Psychologist, 77(4), 602–615. https://doi.org/10.1037/amp0000879
3. DeCoster, J., Sparks, E. A., Sparks, J. C., Sparks, G. G., & Sparks, C. W. (2015). Opportunistic biases: Their origins, effects, and an integrated solution. American Psychologist, 70(6), 499–514. https://doi.org/10.1037/a0039191
4. Moore, D. A. (2016). Preregister if you want to. American Psychologist, 71(3), 238–239. https://doi.org/10.1037/a0040195
5. Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA publications and communications board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191
6. Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Gøtzsche, P. C., Devereaux, P. J., Elbourne, D., Egger, M., & Altman, D. G. (2010). CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomized trials. Journal of Clinical Epidemiology, 63(8), e1–e37. https://doi.org/10.1016/j.jclinepi.2010.03.004
7. Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. Journal of Clinical Epidemiology, 63(8), 834–840. https://doi.org/10.1016/j.jclinepi.2010.02.005
8. von Elm, E., Altman, D. G., Egger, M., Pocock, S. J., Gøtzsche, P. C., & Vandenbroucke, J. P. (2007). The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. PLoS Medicine, 4(10), e296. https://doi.org/10.1371/journal.pmed.0040296
9. Hrynaszkiewicz, I., & Altman, D. G. (2009). Towards agreement on best practice for publishing raw clinical trial data. Trials, 10(1), 17. https://doi.org/10.1186/1745-6215-10-17
10. Hrynaszkiewicz, I., Norton, M. L., Vickers, A. J., & Altman, D. G. (2010). Preparing raw clinical data for publication: Guidance for journal editors, authors, and peer reviewers. Trials, 11(1), 9. https://doi.org/10.1186/1745-6215-11-9
11. Laine, C., Goodman, S. N., Griswold, M. E., & Sox, H. C. (2007). Reproducible research: Moving toward research the public can really trust. Annals of Internal Medicine, 146(6), 450–453. https://doi.org/10.7326/0003-4819-146-6-200703200-00154
12. Peng, R. D. (2009). Reproducible research and biostatistics. Biostatistics, 10(3), 405–408. https://doi.org/10.1093/biostatistics/kxp014
13. R Development Core Team. (2024). R: A language and environment for statistical computing (Version 4.4.1) [Computer Program]. R Foundation for Statistical Computing. http://www.R-project.org.
14. RStudio Team. (2024). RStudio Desktop: Integrated development environment for R (Version 2024.09.0+375) [Computer Program]. Posit Software, PBC. https://posit.co
15. Allaire, J. J., Dervieux, C., Scheidegger, C., Teague, C., & Xie, Y. (2024). Quarto (Version 1.5.57) [Computer Program]. Posit Software, PBC. https://quarto.org
16. Torvalds, L., Hamano, J. C., & other contributors to the Git Project. (2024). Git for Windows (Version 2.47.0(1)) [Computer Program]. Software Freedom Conservancy. https://git-scm.com
17. Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using r (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986
18. Wickham, H., & Bryan, J. (2021). R packages: Organize, test, document, and share your code. O’Reilly Media. https://r-pkgs.org
19. Bryan, J. (2018). Excuse me, do you have a moment to talk about version control? The American Statistician, 72(1), 20–27. https://doi.org/10.1080/00031305.2017.1399928
20. Chacon, S., & Straub, B. (2014). Pro Git. Apress Media. https://git-scm.com/book/en/v2
21. Dauber, D. (2024). R for non-programmers: A guide for social scientists [Electronic Book]. https://r4np.com/
22. Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (2024). What they forgot to teach you about R. https://rstats.wtf
23. Bryan, J., TAs, T. S. 545., & Hester, J. (n.d.). Happy Git and GitHub for the useR [Web Page]. https://happygitwithr.com

Discussion Time

Thank you for attending!